## Loading required package: grid
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Lets begin by showing dataset variables

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Now I need to get more details about the types of variables in the dataset

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

I can see that this dataset has 1599 observations with 13 variables. all the variables are of type num except for X and quality which are of type int.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This summary shows that X represents observations numbering or identifiers So it has no effect on the quality of the red wine. we can ignore it.

The quality is an ordered, discrete variable.

Quality of 75% of red wines are less than or equal to 6.

The other variables are continuous variables.

median fixed.acidity is 7.90.The max volatile.acidity is 1.58.The median PH is 3.31

Univariate Plots

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

It seems like quality has a normal distribution with discrete values of quality (3, 4, 5, 6, 7 amd 8).

Almost 640 wine have quality 5, 620 have quality 6, then 7, 4, 8, and finally 3 with the least number of wines

we can categorize the qaulity into 3 categories (bad, fair and good) by creating new categorical variable called quality_rating

##  bad fair good 
##   63 1319  217

Here we have 63 bad wines, 1319 fair wines and 217 good wines.

The most dominant quality is the fair quality

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

volatile.acidity has a long tailed distribution.lets trasnsform it using log10 base

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

volatile.acidity is normally distributed.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

fixed.acidity has a long tailed distribution.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

displaying fixed.acidity on log10 base scale reveals that fixed.acidity has a normal distribution.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## 
## FALSE  TRUE 
##  1467   132

As we can see most wines have citric.acid between 0 and 0.5. 132 red wines have 0 citric.acid value.

citric.acid is not normally distributed.

Now I’ll create new varible represents total fixed acids of wine (fixed.acidity + citric.acids). lets call it total.fixed.acids

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

total.fixed.acids variables has a long tailed distribution.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Plotting total.fixed.acids on log 10 base scale reveals that total.fixed.acids is normally distributed.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

residual.sugar has heavy tailed distribution with alot of outliers.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Even log10 scale of residual.sugar assures this heavy tailed distribution.

I thought of creating new variable classifying red wines into 2 categories (sweet and non-sweet) wines but in the dataset the max value of residual.sugar is 15.500 and the wine is considered sweet if it has at least 45 residual.sugar.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The last plot show that most wines have chlorides value less than 0.2. It also shows that chlorides has heavy tailed distribution with many outliers like residual.sugar.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The distribution is not clear.I need to adjust binwidth to get better visualization

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

adjusting binwidth reveals that free.sulfur.dioxide has a right skewed distribution.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

total.sulfur.dioxide has a right skewed distribution like free.sulfur.dioxide.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect

It is obvious that density is normally distributed

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

PH has a normal distribution with few outliers.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

sulphates has a long tailed distribution.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Transforming sulphates on log10 base shows that sulphates has a normal distribution.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

alcohol has a right skewed distribution.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

even transformation on log10 base assures that alcohol has a non normal distribution.

Univariate Analysis

What is the structure of your dataset?

1599 red wines in the dataset with 13 features (x, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality). All variables are num except for X and quality which are int.

Other observations: The quality is a discrete variable while all others are continuous variables.

Quality variable has a normal distribution

volatile.acidity, fixed.acidity and sulphates appear to have normal distribution when plotting them on log 10 base.

chlorides and residual.sugar have heavy tailed distribution with alot of outliers.

free.sulfur.dioxide and total.sulfur.dioxide have a long tailed distribution.

density and pH are normally distributed.

The quality of 75% of red wines are less than or equal to 6.

Many wines have 0 citric.acid

Min Quality is 3 and Max quality is 8.

Median fixed.acidity is 7.90.

Max volatile.acidity is 1.58.

Median PH is 3.31.

X variable is just an identifier of the observations.

What is/are the main feature(s) of interest in your dataset?

I am very interested in the quality of red wine. I want to explore the variables affecting it.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

from googling and the variables descriptions, I think that the bellow variables will support my investigation into the quality variable

1- Acids [Fixed, Volatile and citric]

2- alcohol

3- pH

4- total sulfur dioxide

Did you create any new variables from existing variables in the dataset?

1- quality_rating: which is a categorical variable of quality variable

2- total.fixed.acids: sum of fixed and citric acids in wine

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

there were long tailed and heavy tailed distributions besides normal. All I did with these data just setting binwidth and transform data to get better visualization.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
## total.fixed.acids       0.99704157     -0.294847154  0.72665884
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
## total.fixed.acids       0.121334969  0.108045144        -0.148947646
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
## total.fixed.acids             -0.10127190  0.65737801 -0.68958445
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000
## total.fixed.acids     0.202161691 -0.04578490  0.13852654
##                      total.fixed.acids
## fixed.acidity                0.9970416
## volatile.acidity            -0.2948472
## citric.acid                  0.7266588
## residual.sugar               0.1213350
## chlorides                    0.1080451
## free.sulfur.dioxide         -0.1489476
## total.sulfur.dioxide        -0.1012719
## density                      0.6573780
## pH                          -0.6895844
## sulphates                    0.2021617
## alcohol                     -0.0457849
## quality                      0.1385265
## total.fixed.acids            1.0000000

we can see that quality has a moderate positive correlation with alcohol (0.476) and negative correlation with volatile.acidity (-0.391).

pH is highly correlated with both fixed.acidity (-0.683) and citric.acid (-0.542) which is meaningful relationship refering to pH description

Also note that free.sulfur.dioxid correaltes with total.sulfur.dioxide (0.668) which is meaningful as free.sulfur.dioxid is subset of total.sulfur.dioxide.

Finally we can see that total_acids is correlated with fixed.acidity (0.996) and citric.acid (0.690) which seems logical because total acids variable is the sum of the 3 acids.

This plot shows that high quality wine has high value of alcohol.we can also notice the vertical strips which indicates that quality is a discrete variable taking one of these values (3, 4, 5, 6, 7, and 8).Median increases for high qualities(6, 7 and 8). 75 % of high quality wines have alcohol values exeeding 11. in lower qualities wines it is under 11.

high quality wine has low value of volatile.acidity which matches the effect of high level of volatile.acidity on the quality (in volatile.acidity variable description)

high qualities have high values of fixed.acidity

This plot shows a clear positive impact of citric acid on the quality.

The plot doesn’t show any impact of residual.sugar on quality.median and quantiles values of residual.sugar cross the ualities are very close.

high quality wine has low value of chlorides

The plot shows that there is no linear relationship between free.sulfur.dioxide and quality

I can hardly see that high quality wine has low total.sulfur.dioxide value.

low qualities have high values of density while high qualities have lower values of density.So we can say that high quality wine has low density value.

It is seems that high quality wine has low pH value

high quality wines have higher values of sulphates than low quality wines.

high quality wines have high value of total.fixed.acids

bad and fair wines have the same median value of alcohol but good wines have higher median value of alcohol.

This plot shows that good wines have high values of both fixed.acidity and citric.acid, and low values of volatile.acidity.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

The plot shows an exponential relationship between total.fixed.acids and pH.

The plot reveals a linear relationship between total.fixed.acids and fixed.acidity

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

quality correlates fairly with alcohol and volatile.acidity, high quality wines have high value of alcohol and low value of volatile.acidity.

quality has low positive correlation with fixed.acidity, citric.acid, sulphates and total.fixed.acids.It has low negative correlation with density, total.sulfur.dioxide and chlorides.

It seems that quality has a weak correlation with pH, residual.sugar and free.sulfur.dioxide.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes I observed the relationship between total_acids and pH Which seems to be exponential with strong negative correlation, which is logical as pH is a measure of acids.

Also the relationship between fixed.acidity and total.fixed.acids was observed linear relationship.

What was the strongest relationship you found?

The relationship between fixed.acidity and total.fixed.acids

Multivariate Plots Section

In this section I ’ll explore the most interesting variables that may affect the quality in conjunction with quality and quality_rating variable.

This plot shows a weak negative correlation between volatile.acidity and alcohol. we can notice that good wines have high values of alcohol and low values of volatile.acidity.

There is a strong relationship between fixed.acidity and citric.acid. it is also clear that good quality has high value of both fixed.acidity and citric acid.

The strong relationship between free.sulfur.dioxide and total.sulfur.dioxide is clear.we can also notice that free.sulfur.dioxide has almost no effect on the quality.all qualities take almost the same range of free.sulfur.dioxide’s values. But for total.sulfur.dioxide we can hardly see that high quality wines have high value of free.sulfur.dioxide.

This plot shows the relationships between pH and both of fixed.acidity and total.fixed.acids which seem to be strong.

As fixed.acidity or total.fixed.acids increases the pH decreases. It is also clear that good wines have high values of fixed.acidity and total.fixed.acids, and low values of pH.

There is a weak correlation between alcohol and pH. good wines have higher alcohol and lower pH values than bad and fair wines.

This plot shows no relationship between alcohol and total.fixed.acids but it reveals that good wines have high values of both alcohol and total.fixed.acids.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

By faceting the plots by quality rating, I can visualize the relationships between many variables and thier impact on the quality.

Starting with alcohol which has the highest correlation with quality, I notice that when alcohol increases the volatile.acidity decreases and wine quality increases.which is meaningful as we know from variable description that good wines have low value of volatile.acidity.

There is no relationship between alcohol and total.fixed.acids but both variables correlate with quality.good wines have high values of alcohol and total.fixed.acids.

Fixed.acidity and citric.acid are correlated to each others and have a little impact on quality. good wines have high values of both Fixed.acidity and citric.acid.

free.sulfur.dioxide and total.sulfur.dioxide are strongly correlated. we can see no impact for free.sulfur.dioxide on the quality but total.sulfur.dioxide has a little positive impact on the quality.

Finally pH seems to have strong correlation with both fixed.acidity and total.fixed.acids which is meaningful as pH is a measure of fixed.acidity and fixed.acidity is subset of total.fixed.acids.it also has a weak correlation with alcohol.

Were there any interesting or surprising interactions between features?

From googling for pH variable I Knew that it is almost the backbone of wine quality but surprisingly I found almost no relationship between quality and pH it is extremely weak.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No

Final Plots and Summary

Plot One

Description One

This plot demonstrates wine quality, which takes an ordered discrete value from 3 to 8. filled in quality rating which categorizes the quality values into 3 categories [bad, fair, and good].

we can see that the most dominant quality in the dataset is fair [5 and 6], then good [7 and 8], and least one is bad [3 and 4]

Plot Two

Description Two

This plot shows the impact of acids on the quality.we can clearly see that good wines have low values of volatile.acidity and high values of both fixed and citric acids.

Plot Three

Description Three

This plot demonstrates the impact of alcohol on quality, good wines have high values of alcohol.

Reflection

This dataset contains 1599 observations of 15 variables including quality varible that is my interesting feature. I began exploring all the variables individually.

After that I created a new categorical variable ‘quality_rating’ which categorizes the quality in a meaningful term rathar than the quality numbers.I have also collected all fixed acids variables (fixed.acidity and citric acids) into one variable total.fixed.acids.

I explored the quality variable across all the other variables to understand the impact of the variables on the quality.

I am able to specify that the main features affecting the quality are alcohol and acids.there is also other features that have a low impact like sulphates, density and pH.

I would be interesting in creating a linear model and testing its accuracy